A Comparison of Different Approaches to Hierarchical Clustering of Ordinal Data

نویسندگان

  • Aleš Žiberna
  • Nataša Kejžar
  • Petra Golob
چکیده

The paper tries to answer the following question: “How should we treat ordinal data in hierarchical clustering?” The question is strongly connected to the use of questionnaires with ordinal scales in the social sciences. The results could help to differentiate among answers to the questions from questionnaires that could be considered as scale variables, those it would be better to convert to ranks and those that should be treated as nominal variables. To make the results general several two-dimensional combinations of group sizes, shapes and differences between their centers were used as well as one three-dimensional combination. Each combination was simulated both with and without unessential variables. All datasets consisted of 3 groups, each with its own multivariate distribution (2 or 3 variables) with known means and covariances. From each design several datasets were simulated. Each variable was cut and recoded to achieve an ordinal scale. Different cutting schemes were used (the intervals were of equal size, either increasing/decreasing from the lowest to the highest value or decreasing from the mean to both extremes). These new variables were then treated as interval, converted to ranks and treated as nominal. Then hierarchical clustering algorithms were used. Ward's algorithm with Squared Euclidean distance was used when data were considered interval or converted to ranks, and Ward's algorithm with matching coefficient as dissimilarity measure was used when they were considered nominal. The quality of the results was assessed by comparing the gained partitions with the three original groups. We also compared results from clustering the original (uncut) data with the three original groups for comparison. The comparison was made using Corrected Rand Index. The results indicate that in most cases treating the data as interval or converting them to ranks yields better results than treating them as nominal, but the differences are sometimes diminished when cutting into a smaller number of intervals. 1 University of Ljubljana, Slovenia 58 Aleš Žiberna, Nataša Kejžar, and Petra Golob

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

به کارگیری روش‌های خوشه‌بندی در ریزآرایه DNA

Background: Microarray DNA technology has paved the way for investigators to expressed thousands of genes in a short time. Analysis of this big amount of raw data includes normalization, clustering and classification. The present study surveys the application of clustering technique in microarray DNA analysis. Materials and methods: We analyzed data of Van’t Veer et al study dealing with BRCA1...

متن کامل

Evaluating Different Approaches to Permeability Prediction in a Carbonate Reservoir

Permeability can be directly measured using cores taken from the reservoir in the laboratory. Due to high cost associated with coring, cores are available in a limited number of wells in a field. Many empirical models, statistical methods, and intelligent techniques were suggested to predict permeability in un-cored wells from easy-to-obtain and frequent data such as wireline logs. The main obj...

متن کامل

Assessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories

In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...

متن کامل

A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach

In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...

متن کامل

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004